- Simple Linear regression
- Multiple Linear regression
- Qualitative predictors (dummy variables)
- Extensions: interactions, nonlinear effects
- Potential issues: outliers and assumption verification (residual plots and transformations)
9/10/2020
Linear regression is a simple approach to the generic supervised learning problem \[Y = f(X_1,X_2,\dots,X_p) + \varepsilon.\] It assumes that the dependence of \(Y\) on \(X_1,X_2,\dots,X_p\) is linear, therefore it is a parametric model.
We can can write down \(f(\mathbf{X})\) in the form of an equation: \[y = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p + \varepsilon = \mathbf{x}^\intercal \cdot \boldsymbol{\beta} + \varepsilon,\] identically and independently for every data point, where \(\mathbf{x} = (1,x_1,x_2,\dots,x_p)\) and \(\boldsymbol{\beta} = (\beta_0, \beta_1, \beta_2, \dots, \beta_p)\).
A notational convenience: the intercept gets absorbed into the vector of predictors by including a leading term of \(1\).
Design matrix \(\mathbf{X}\):
\[\mathbf{X} = \left[ \begin{array}{cccc} 1 & x_{11} & \dots & x_{1p} \\ 1 & x_{21} & \dots & x_{2p} \\ \vdots & & \ddots & \vdots \\ 1 & x_{n1} & \dots & x_{np} \end{array} \right], \quad \boldsymbol{\beta} = \left[ \begin{array}{c} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p} \end{array} \right], \quad \mathbf{y} = \left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array} \right], \quad \boldsymbol{\varepsilon} = \left[ \begin{array}{c} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{n} \end{array} \right]\]
\[y_{i} = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} + \varepsilon_{i} \quad \forall i = 1, \dots, n\] \[\mathbf{y} = \mathbf{X} \cdot \boldsymbol{\beta} + \boldsymbol{\varepsilon}\]
Assumptions:
(2)+(3)+(4):
\[\varepsilon_1, \dots, \varepsilon_n \stackrel{iid}{\sim} \mathcal{N}(0, \sigma^2)\]
We assume a model where \(\beta_0\) and \(\beta_1\) are two unknown parameters that represent the intercept and slope, \(Y = \beta_0 + \beta_1 x + \varepsilon\), and \(\varepsilon\) is the error term.
Given some estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) for the model coefficients, we predict future outcomes using \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\] where \(\hat{y}\) indicates a prediction of \(Y\) on the basis of \(X = x\).
A reasonable way to fit a line is to minimize the amount by which the fitted value differs from the actual value. This amount is called the residual. Then \(e_i = y_i - \hat{y}_i\) represents the \(i^{th}\) residual.
Ideally we want to minimize the size of all residuals:
We define the residual sum of squares (RSS) as \[\begin{aligned} RSS &= e_1^2 + e_2^2 + \dots + e_n^2 \\ &= (y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 +(y_2 - \hat{\beta}_0 - \hat{\beta}_1 x_2)^2 +···+(y_n - \hat{\beta}_0 - \hat{\beta}_1 x_n)^2 \end{aligned}\]
The minimizing values for the RSS are
\[\hat{\beta}_1 = r_{xy} \frac{s_y}{s_x}, \quad \hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}\]
Sample mean: measure of centrality \[\bar{y} = \dfrac{1}{n} \sum_{i=1}^n y_i\]
Sample variance: measure of spread \[s_y^2 = \dfrac{1}{n-1} \sum_{i=1}^n (y_i - \overline{y})^2\]
Sample standard deviation \[s_y = \sqrt{s_y^2}\]
\[\text{Cov}(Y, X) = \text{Cov}(X, Y) = \dfrac{\sum_{i=1}^n (y_i - \overline{y})(x_i - \overline{x})}{n - 1}\]
The correlation \(r_{xy}\) is the standardized covariance: \[r_{xy} = \text{corr}(Y, X) = \dfrac{\text{Cov}(Y, X)}{\sqrt{s_x^2 s_y^2}} = \dfrac{\text{Cov}(Y, X)}{s_x s_y}\]
The correlation is scale invariant and the units of measurement do not matter: it is always true that \(-1 \leq \text{corr}(Y, X) \leq 1\).
This gives the direction (- or +) and strength (0 \(\rightarrow\) 1) of the linear relationship between \(X\) and \(Y\).
Only measures linear relationships: \(\text{corr}(X, Y) = 0\) does not mean the variables are not related!
Also be careful with influential observations.
For the same reason, the linear regression cannot be interpreted in a causal way.
Remember that \(Y = \widehat{Y} + \varepsilon\). Since \(\widehat{Y}\) and \(\varepsilon\) are uncorrelated, i.e. \(\text{corr}(\widehat{Y}, \varepsilon) = 0\),
\[ \begin{aligned} &\text{Var}(Y) = \text{Var}(\widehat{Y} + \varepsilon) = \text{Var}(\widehat{Y}) + \text{Var}(\varepsilon) \\ &\sum_{i=1}^n (y_i - \overline{y}_i) = \sum_{i=1}^n (\hat{y}_i - \overline{\hat{y}}_i) + \sum_{i=1}^n (e_i - \overline{e})^2 \end{aligned} \]
Given that \(\overline{e} = 0\) and \(\overline{\hat{y}} = \overline{y}\) (why?), we get \[\underbrace{\sum_{i=1}^n (y_i - \overline{y}_i)^2}_{\substack{\text{Total Sum of} \\ \text{Squares (TSS)}}} = \underbrace{\sum_{i=1}^n (\hat{y}_i - \overline{y}_i)^2}_{\substack{\text{Model Sum of} \\ \text{Squares (MSS)}}} + \underbrace{\sum_{i=1}^n e_i^2}_{\substack{\text{Residual Sum of} \\ \text{Squares (RSS)}}}\]
The coefficient of determination, denoted by \(R^2\), measures goodness of fit:
\[R^2 = \dfrac{MSS}{TSS} = \dfrac{TSS-RSS}{TSS} = 1 - \dfrac{RSS}{TSS}\]
The standard error of an estimator reflects how it varies under repeated sampling. We have (large \(n\) approximation): \[ \begin{aligned} &\text{SE}(\hat{\beta}_1)^2 = \sigma^2 \dfrac{1}{\sum_{i=1}^n (x_i - \overline{x})^2} \approx s^2 \dfrac{1}{\sum_{i=1}^n (x_i - \overline{x})^2} \\ &\text{SE}(\hat{\beta}_0)^2 = \sigma^2 \left[ \dfrac{1}{n} + \dfrac{\overline{x}^2}{\sum_{i=1}^n (x_i - \overline{x})^2} \right] \approx s^2 \left[ \dfrac{1}{n} + \dfrac{\overline{x}^2}{\sum_{i=1}^n (x_i - \overline{x})^2} \right] \end{aligned} \] where \(\sigma^2 = \text{Var}(\varepsilon)\) is estimated via \(s^2\) (more on this to come).
These standard errors can be used to compute confidence intervals (CI). A \(95\%\) confidence interval is \[\text{CI} = \hat{\beta}_1 \pm 2 \text{SE}(\hat{\beta}_1)\]
Standard errors can also be used to perform hypothesis tests on the coefficients. The most common hypothesis test involves testing:
If \(\beta_1 = 0\), then the model reduces to \(Y = \beta_0 + \varepsilon\), and \(X\) is not associated with \(Y\).
To test the null hypothesis, we compute a \(t\)-statistic, given by \[t = \dfrac{\hat{\beta}_1 - 0}{\text{SE}(\hat{\beta}_1)}\]
This will have a \(t\)-distribution with \(n-2\) degrees of freedom, assuming \(\beta_1 = 0\).
Using statistical software, it is easy to compute the probability of observing any value equal to \(|t|\) or larger. We call this probability the p-value.
This is equivalent to the confidence interval: a value for \(|t|\) greater than \(1.96\) implies that the proposed value is outside of the confidence interval… therefore reject!
Suppose in testing \(H_0: \beta_1 = 1\) you got a t-stat of \(6\) and the confidence interval was \([1.00001, 1.00002]\).
Now, suppose in testing \(H_0: \beta_1 = 1\) you got a t-stat of \(-0.02\) and the confidence interval was \([-100, 100]\).
There are two things that we want to know:
Goal: measure the accuracy of our forecasts, or how much uncertainty there is in the forecast. One method is to specify a range of \(Y\) values that are likely, given an \(X\) value.
Prediction Interval: probable range for \(Y\)-values given \(X\).
Key Insight: to construct a prediction interval, we will have to assess the likely range of error values corresponding to a \(Y\) value that has not yet been observed!
You are told (without looking at the data) that \(\beta_0 = 40; \beta_1 = 45; \sigma = 10\), and you are asked to predict price of a \(1500\) square foot house.
What do you know about \(Y\) from the model?
\[Y = 40 + 45 (1.5) + \varepsilon = 107.5 + \varepsilon\]
Thus our prediction for price is \(Y \mid (X = 1.5) \sim \mathcal{N}(107.5, 10^2)\) and a \(95\%\) Prediction Interval for \(Y\) is \(87.5 < Y < 127.5\).
The conditional distribution for Y given X is Normal: \[Y \mid (X = x) \sim \mathcal{N}(\beta_0 + \beta_1 x ,\sigma^2)\]
We estimate \(s^2\) with: \[s^2 = \dfrac{1}{n-2} \sum_{i=1}^n e_i^2 = \dfrac{RSS}{n-2}\] (\(2\) is the number of regression coefficients; i.e. \(2\) for \(\beta_0\) and \(\beta_1\)).
We usually use \(s = \sqrt{RSS/(n-2)}\), in the same units as \(Y\). It is also called the regression standard error.
fit <- lm(y ~ x) summary(fit)
## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -23.830 -11.178 2.211 9.141 20.217 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 38.728 8.938 4.333 0.000401 *** ## x 44.069 3.724 11.833 6.32e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 12.72 on 18 degrees of freedom ## Multiple R-squared: 0.8861, Adjusted R-squared: 0.8798 ## F-statistic: 140 on 1 and 18 DF, p-value: 6.324e-10
Questions we might ask:
Here our model is \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \varepsilon\] We interpret \(\beta_j\) as the average effect on \(Y\) of a one unit increase in \(X_j\) , holding all other predictors fixed. But predictors usually change together!
In the advertising example, the model becomes \[\text{sales} = \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio} + \beta_3 \times \text{newspaper} + \varepsilon\]
The ideal scenario is when the predictors are uncorrelated - a balanced design:
Correlations amongst predictors cause problems:
A useful way to plot the results for the multiple linear regression is to look at \(Y\) (observed values) against \(\hat{Y}\) (fitted values).
If things are working, these values should form a nice straight line. Can you guess the slope of the line?
Intervals and t-statistics are exactly the same as in the simple linear regression.
Important: intervals and testing via \(\hat{\beta}_j\) and \(\text{SE}(\hat{\beta}_j)\) are one-at-a-time procedures: we are evaluating the \(j^{th}\) coefficient conditional on the other \(X\)’s being in the model, but regardless of the values we have estimated for the other \(\hat{\beta}\)’s.
We can use the F-statistic:
\[F = \dfrac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(n-p-1)} \sim F_{p,n-p-1}\] Note that we are not testing the significance of the intercept!
Especially in high dimensions, it is very unlikely that all of the predictors are relevant to explain the outcome variable.
Usually, predictive accuracy (and interpretability) can be by selecting an “optimal” subset of predictors.
More to come on this…
We have data about monthly sales of a certain item and its corresponding price (P1) from a vendor.
It looks like we should just raise our prices, right?
The regression equation for Sales on own price is: \[\text{Sales} = 985.5 + 66.5 P1\]
If now we add the competitors price to the regression we get \[\text{Sales} = 98.8 -61.7 P1 + 140.9 P2\]
Does this look better? How did it happen?
Remember: -61.7 is the effect on sales of a change in P1 with P2 held fixed!
Let us look at a subset of points where P1 varies and P2 is held approximately constant.
For a fixed level of P2, variation in P1 is negatively correlated with with Sales!